arXivrl 502.074 llv6 [cs.CV] 25 Nov 2015 


1 


Learning Depth from Single Monocular Images 
Using Deep Convolutional Neural Fields 

Fayao Liu, Chunhua Shen, Guosheng Lin, Ian Reid 


Abstract —In this article, we tackle the problem of depth estimation from single monocular images. Compared with depth estimation 
using multiple images such as stereo depth perception, depth from monocular images is much more challenging. Prior work typically 
focuses on exploiting geometric priors or additional sources of information, most using hand-crafted features. Recently, there is 
mounting evidence that features from deep convolutional neural networks (CNN) set new records for various vision applications. 
On the other hand, considering the continuous characteristic of the depth values, depth estimation can be naturally formulated as 
a continuous conditional random field (CRF) learning problem. Therefore, here we present a deep convolutional neural field model for 
estimating depths from single monocular images, aiming to jointly explore the capacity of deep CNN and continuous CRF. In particular, 
we propose a deep structured learning scheme which learns the unary and pairwise potentials of continuous CRF in a unified deep 
CNN framework. We then further propose an equally effective model based on fully convolutional networks and a novel superpixel 
pooling method, which is about 10 times faster, to speedup the patch-wise convolutions in the deep model. With this more efficient 
model, we are able to design deeper networks to pursue better performance. 

Our proposed method can be used for depth estimation of general scenes with no geometric priors nor any extra information injected. 
In our case, the integral of the partition function can be calculated in a closed form such that we can exactly solve the log-likelihood 
maximization. Moreover, solving the inference problem for predicting depths of a test image is highly efficient as closed-form solutions 
exist. Experiments on both indoor and outdoor scene datasets demonstrate that the proposed method outperforms state-of-the-art 
depth estimation approaches. 

Index Terms —Depth Estimation, Conditional Random Field (CRF), Deep Convolutional Neural Networks (CNN), Fully Convolutional 
Networks, Superpixel Pooling. 


1 Introduction 

Estimating depth information from single monocular 
images depicting general scenes is an important problem 
in computer vision. Many challenging computer vision 
problems have proven to benefit from the incorporation 
of depth information, to name a few, semantic labellings 
ED/ pose estimations (2). Although the highly devel¬ 
oped depth sensors such as Microsoft Kinect nowadays 
have made the acquisition of RGBD images affordable, 
most of the vision datasets commonly evaluated among 
the vision community are still RGB images. Moreover, 
outdoor applications still rely on LiDAR or other laser 
sensors due to the fact that strong sunlight can cause 
infrared interference and make depth information ex¬ 
tremely noisy. This has led to wide research interest on 
the topic of estimating depths from single RGB images. 
Unfortunately, it is a notoriously ill-posed problem, as 
one captured image scene may correspond to numerous 
real world scenarios on 

Whereas for humans, inferring the underlying 3D 
structure from a single image is effortless, it remains a 
challenging task for automated computer vision systems 
to do so since no reliable cues can be exploited, such 
as temporal information in videos, stereo correspon- 
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Fig. 1 - Examples of depth estimation results using the proposed 
deep convolutional neural fields model. First row: NYU v2 
dataset; second row: Make3D dataset. From left to right: input 
image, ground-truth, our prediction. 


dences, etc. Previous work mainly focuses on enforcing 
geometric assumptions, e.g., box models, to infer the 
spatial layout of a room |4), [5| or outdoor scenes If6l . 
These models come with innate restrictions, which are 
limitations to model only particular scene structures 
and therefore are not applicable for general scene depth 
estimations. More recently, non-parametric methods 0 
are explored, which consists of candidate images re¬ 
trieval, scene alignment and then depth inference using 
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optimizations with smoothness constraints. This is based 
on the assumption that scenes with semantically similar 
appearances should have similar depth distributions 
when densely aligned. However, this method is prone to 
propagate errors through the different decoupled stages 
and relies heavily on building a reasonable sized image 
database to perform the candidate retrieval. In recent 
years, efforts have been made towards incorporating 
additional sources of information, e.g., user annotations 
l8l , semantic labellings (9), H). In the recent work of 
HI, Ladicky et al. have shown that jointly performing 
depth estimation and semantic labelling can benefit each 
other. Nevertheless, all these methods use hand-crafted 
features. 

In contrast to previous efforts, here we propose to 
formulate the depth estimation as a deep continuous 
Conditional Random Fields (CRF) learning problem, 
without relying on any geometric priors or any extra 
information. CRF llQl is a popular graphical model for 
structured output predictions. While extensively studied 
in classification (discrete) domains, CRF has been less 
explored for regression (continuous) problems. One of 
the pioneer work on continuous CRF can be attributed 
to [111, in which it was proposed for global ranking 
in document retrieval. Under certain constraints, they 
can directly solve the maximum likelihood optimization 
as the partition function can be analytically calculated. 
Since then, continuous CRF has been successfully ap¬ 
plied for solving various structured regression problems, 
e.g., remote sensing (12l . H5I . image denoising fl3l . 
Motivated by these successes, we here propose to use 
it for depth estimation, given the continuous nature of 
the depth values, and learn the potential functions in a 
deep convolutional neural network (CNN). 

Recent years have witnessed the prosperity of the deep 
CNN ©I since the breakthrough work of Krizhevsky et 
al. [15j|. CNN features have been setting new records for 
a wide variety of vision applications [|16]. Despite all the 
successes in classification problems, deep CNN has been 
less explored for structured learning problems, i.e., joint 
training of a deep CNN and a graphical model, which is 
a relatively new and not well addressed problem. To our 
knowledge, no such model has been successfully used 
for depth estimations. Here, we bridge this gap by jointly 
exploring CNN and continuous CRF, denoting this new 
method as a deep convolutional neural field (DCNF). 

Fully convolutional networks have recently been stud¬ 
ied for dense prediction problems, e.g., semantic la¬ 
belling ca, ge Models based on fully convolutional 
networks have the advantage of highly efficient train¬ 
ing and prediction. We here exploit this advance to 
speedup the training and prediction of our DCNF model. 
However, the feature maps produced by the fully con¬ 
volutional models are typically much smaller than the 
input image size. This can cause problems for both 
training and prediction. During training, one needs to 
downsample the ground-truth maps, which may lead to 
information loss since small objects might disappear. In 


prediction, the upsampling operation is likely to bring 
in degraded performance at the object boundaries. We 
therefore propose a novel superpixel pooling method 
to address this issue. It jointly exploits the strengths 
of highly efficient fully convolutional networks and the 
benefits of superpixels at preserving object boundaries. 

To sum up, we highlight the main contributions of this 
work as follows. 

• We propose a deep convolutional neural field 
(DCNF) model for depth estimations by exploring 
CNN and continuous CRF. Given the continuous 
nature of the depth values, the partition function 
in the probability density function can be analyti¬ 
cally calculated, therefore we can directly solve the 
log-likelihood optimization without any approxima¬ 
tions. The gradients can be exactly calculated in the 
back propagation training. Moreover, solving the 
MAP problem for predicting the depth of a new 
image is highly efficient since closed form solutions 
exist. 

• We jointly learn the unary and pairwise potentials of 
the CRF in a unified deep CNN framework, which 
is trained using back propagation. 

• We propose a faster model based on fully convo¬ 
lutional networks and a novel superpixel pooling 
method, which results in ^ 10 times speedup while 
producing similar prediction accuracy. With this 
more efficient model, which we refer as DCNF- 
FCSP, we are able to design very deep networks 
for better performance. 

• We demonstrate that the proposed method outper¬ 
forms state-of-the-art results of depth estimation on 
both indoor and outdoor scene datasets. 

Preliminary results of our work appeared in Liu et al. 

im 

1.1 Related work 

Our method exploits the recent advances of deep nets 
in image classification lfl5l , [20], object detection (21] 
and semantic segmentation [17], [18], for single view 
image depth estimations. In the following, we give a 
brief introduction to the most closely related work. 

Depth estimation Traditional methods (2D, [23], ID 
typically formulate the depth estimation as a Markov 
Random Field (MRF) learning problem. As exact MRF 
learning and inference are intractable in general, most 
of these approaches employ approximation methods, 
e.g., multi-conditional learning (MCL) or particle belief 
propagation (PBP). Predicting the depths of a new im¬ 
age is inefficient, taking around 4-5s in [23] and even 
longer (30s) in (9j- To make things worse, these methods 
suffer from lacking of flexibility in that (22], If23l rely 
on horizontal alignment of images and (9) requires the 
semantic labellings of the training data available before¬ 
hand. More recently, Liu et al. l24l propose a discrete- 
continuous CRF model to take into consideration the 
relations between adjacent superpixels, e.g., occlusions. 
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They also need to use approximation methods for learn¬ 
ing and maximum a posteriori (MAP) inference. Besides, 
their method relies on image retrieval to obtain a reason¬ 
able initialization. In contrast, here we present a deep 
continuous CRF model in which we can directly solve 
the log-likelihood optimization without any approxima¬ 
tions, since the partition function can be analytically 
calculated. Predicting the depth of a new image is highly 
efficient since a closed form solution exists. Moreover, 
we do not inject any geometric priors nor any extra 
information. On the other hand, previous methods 1231 , 
in m, m, m all use hand-crafted features in their 
work, e.g., texton, GIST, SIFT, PHOG, object bank, etc. 
In contrast, we learn deep CNN for constructing unary 
and pairwise potentials of the CRF. 

Recently, Eigen et al. 0 proposed a multi-scale CNN 
approach for depth estimation, which bears similarity to 
our work here. However, our method differs critically 
from theirs: they use the CNN as a black-box by directly 
regressing the depth map from an input image through 
convolutions; in contrast we use a CRF to explicitly 
model the relations of neighboring superpixels, and 
learn the potentials (both unary and binary) in a unified 
CNN framework. 

Recent work of (25l and Ii26l is relevant to ours in that 
they also perform depth estimation from a single image. 
The method of Su et al. (25) involves a continuous depth 
optimization step like ours, which also contains a unary 
regression term and a pairwise local smoothness term. 
However, these two works focus on 3D reconstruction of 
known segmented objects while our method targets at 
depth estimation of general scene images. Furthermore, 
the method of [25] relies on a pre-constructed 3D shape 
database of input object categories, and the work of 
[26] relies on class-specific object keypoints and object 
segmentations. In contrast, we do not inject these priors. 

Combining CNN and CRF In (27) . Farabet et al. pro¬ 
pose a multi-scale CNN framework for scene labelling, 
which uses CRF as a post-processing step for local 
refinement. In the most recent work of [28], Tompson 
et al. present a hybrid architecture for jointly training 
a deep CNN and an MRF for human pose estimation. 
They first train a unary term and a spatial model sep¬ 
arately, then jointly learn them as a fine tuning step. 
During fine tuning of the whole model, they simply 
remove the partition function in the likelihood to have 
a loose approximation. In contrast, our model performs 
continuous variable prediction. We can directly solve 
the log-likelihood optimization without using approxi¬ 
mations as the partition function is integrable and can 
be analytically calculated. Moreover, during prediction, 
we have closed-form solutions to the MAP inference 
problem. Although no convolutional layers are involved, 
the work of [29] shares similarity with ours in that 
both continuous CRF's use neural networks to model 
the potentials. Note that the model in [29] is not deep 
and only one hidden layer is used. It is unclear how 
the method of (29) performs on the challenging depth 


estimation problem that we consider here. 

Fully convolutional networks Fully convolutional 
networks have recently been actively studied for dense 
prediction problems, e.g., semantic segmentation [17], 
(181 , image restoration (30) , image super-resolution [31], 
depth estimations 0. To deal with the downsampled 
output issue, interpolations are generally applied 0 , 
(18). In [32, Sermanet et al. propose an input shifting 
and output interlacing trick to produce dense predictions 
from coarse outputs without interpolations. Later on. 
Long et al. (T3 present a deconvolution approach to 
put the upsampling into the training regime instead of 
applying it as a post-processing step. The CNN model 
presented in Eigen et al. 0 for depth estimation also 
suffers from this upsampling problem—the predicted 
depth maps of 0 is 1/4-resolution of the original input 
image with some border areas lost. They simply use 
bilinear interpolations to upsample the predictions to 
the input image size. Unlike these existing methods, we 
propose a novel superpixel pooling method to address 
this issue. It jointly exploits the strengths of highly 
efficient fully convolutional networks and the benefits 
of superpixels at preserving object boundaries. 

2 Deep convolutional neural fields 

We present the details of our deep convolutional neural 
field (DCNF) model for depth estimation in this section. 
Unless otherwise stated, we use boldfaced uppercase 
and lowercase letters to denote matrices and column 
vectors respectively; 0 represents a column vector with 
all elements being 0. 

2.1 Overview 

The goal here is to infer the depth of each pixel in a single 
image depicting a general scene. Following the work of 
m, m, m, we make the common assumption that an 
image is composed of small homogeneous regions (su¬ 
perpixels) and consider the graphical model composed 
of nodes defined on superpixels. Each superpixel is 
portrayed by the depth of its centroid. Let x be an image 
and y = [yi, ..., y n Y E M n be a vector of continuous 
depth values corresponding to all n superpixels in x. 
Similar to conventional CRFs, we model the conditional 
probability distribution of the data with the following 
density function: 

p r(yl x ) = At exp(— E(y, x)), (1) 

Z(XJ 

where E is the energy function; Z is the partition func¬ 
tion defined as: 

Z( x ) = f exp{—£(y, x)}dy. (2) 

Here, because y is continuous, the integral in Eq. ([TJ can 
be analytically calculated under certain circumstances, 
which we will show in Sec. 12.31 This is different from 
the discrete case, in which approximation methods need 
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Fig. 2 - An illustration of our DCNF model for depth estimation. The input image is first over-segmented into superpixels. In the unary 
part, for a superpixel p, we crop the image patch centred around its centroid, then resize and feed it to a CNN which is composed of 5 
convolutional and 4 fully-connected layers (details refer to Fig. [3j. In the pairwise part, for a pair of neighboring superpixels (p, q), we 
consider K types of similarities, and feed them into a fully-connected layer. The outputs of unary part and the pairwise part are then fed 
to the CRF structured loss layer, which minimizes the negative log-likelih ood. Predicting the depths of a new image x is to maximize the 
conditional probability Pr(y|x), which has closed-form solutions (see Sec. |2.3| for details). 
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Fig. 3 - Detailed network architecture of the unary part in Fig. [5] 
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to be applied. To predict the depth of a new image, we 
solve the following MAP inference problem: 

y* = argmaxPr(y|x). (3) 

y 

We formulate the energy function as a typical combi¬ 
nation of unary potentials U and pairwise potentials V 
over the nodes (superpixels) 3Sf and edges S of the image 
x: 

E(y,x) = J2u(y p ,x)+ ^2 v (yp’Vq’ x )- ( 4 ) 

pGN (p,<?)cs 

The unary term U aims to regress the depth value for 
a single superpixel. The pairwise term V encourages 
neighboring superpixels with similar appearances to 
take similar depths. We aim to jointly learn U and V 
in a unified CNN framework. 

Fig. [2] sketches our deep convolutional neural field 
model for depth estimation. The whole network com¬ 
prises a unary part, a pairwise part and a continuous 
CRF loss layer. For an input image, which has been 
over-segmented into n superpixels, we consider image 
patches centred around each superpixel centroid. The 


unary part then takes an image patch as input and 
feeds it to a CNN whose output is a single number, the 
regressed depth value of the superpixel. 

The network for the unary part is composed of 5 con¬ 
volutional and 4 fully-connected layers with details in 
Fig-H Note that the CNN parameters are shared across 
all the superpixels. The pairwise part takes similarity 
vectors (each with K components) of all neighboring 
superpixel pairs as input and feeds each of them to 
a fully-connected layer (parameters are shared among 
different pairs), then outputs a vector containing all the 
1-dimensional similarities for each of the neighboring 
superpixel pairs. The continuous CRF loss layer takes 
the outputs from the unary and the pairwise terms to 
minimize the negative log-likelihood. Compared to the 
direct regression method in 0, our model possesses two 
potential advantages: 

• We achieve translation invariance as we construct 
unary potentials irrespective of the superpixel's co¬ 
ordinate (shown in Sec. \2.2\ ; 

• We explicitly model the relations of neighboring 
superpixels through pairwise potentials. 
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In the following, we describe the details of potential 
functions involved in the energy function in Eq. @- 

2.2 Potential functions 

2.2.1 Unary potential 

The unary potential is constructed from the output of a 
CNN by considering the least square loss: 

U(y p ,x;d) = (y p -z p (d)) 2 , Vp=l,...,n. (5) 

Here z p is the regressed depth of the superpixel p 
parametrized by the CNN parameters 6. 

The network architecture for the unary part is depicted 
in Fig. [3] Our CNN model in Fig. [3] is mainly inspired 
by the well-known network architecture of Krizhevsky 
et al. [GD with modifications. It is composed of 5 convo¬ 
lutional layers and 4 fully-connected layers. The input 
image is first over-segmented into superpixels, then for 
each superpixel, we consider the image patch centred 
around its centroid. Each of the image patches is resized 
to 224 x 224 pixels (other resolutions also work) and then 
fed to the convolutional neural network. Note that the 
convolutional and the fully-connected layers are shared 
across all the image patches of different superpixels. Rec¬ 
tified linear units (ReLU) are used as activation functions 
for the five convolutional layers and the first two fully 
connected layers. For the third fully-connected layer, 
we use the logistic function f(x ) = (1 + e~ x )~ x as the 
activation function. The last fully-connected layer plays 
the role of model ensemble with no activation function 
followed. The output is an 1 -dimensional real-valued 
depth for a single superpixel. 


require ft > 0 as in (HI- Note that this is a sufficient 
but not necessary condition. 

Here we consider 3 types of pairwise similarities, 
measured by the colour difference, colour histogram 
difference and texture disparity in terms of local binary 
patterns (EBP) (33), which take the conventional form: 


where s p k \ s q k ^ are the observation values of the su¬ 
perpixel p, q calculated from colour, colour histogram 
and EBP; ||-|| denotes the £2 norm of a vector and 7 
is a constant. It may be possible to learn features for 
the pairwise term too. For example, the pairwise term 
can be a deep CNN with raw pixels as the input. A 
more sophisticated pairwise energy may further improve 
the estimation, especially for complex discrete labelling 
problems, with the price of increased computation com¬ 
plexity. For depth estimation, we find that our current 
pairwise energy already works very well. 

2.3 Learning 

With the unary and the pairwise potentials defined in 
Eq. (0, { 6 }, we can now write the energy function as: 

£(y- x ) = ^Z(vp - z p ) 2 + \ R pv(yp ~ y<i ) 2 ■ ( § ) 

p £-NT (A<?)CS 

For ease of expression, we introduce the following nota¬ 
tion: 


2.2.2 Pairwise potential 

We construct the pairwise potential from K types of sim¬ 
ilarity observations, each of which enforces smoothness 
by exploiting consistency information of neighboring 
superpixels: 

V(y p ,y q ,x-,P) = i R pq (y p - y q ) 2 , Vp,q = (6) 


Here R pq is the output of the network in the pairwise 
part (see Fig. 2| from a neighboring superpixel pair (p, q). 
We use a fully-connected layer here: 


R Pq = p T [s$ , • • •, 7 f T = E foSp? * 

k=1 


(7) 


where is the k -th similarity matrix whose elements 

are Spq (S^ is symmetric); (3 = [/?i,..., /3 fc ] T are the 
network parameters. From Eq. 4ZK we can see that we 
do not use any activation function. However, as our 
framework is general, more complicated networks may 
be seamlessly incorporated for the pairwise part. In 
Sec. 2.3 we will show that we can derive a general 


form for calculating the gradients with respect to (3 (see 
Eq. f[9|}). To ensure that Z(x) in Eq. 0 is integrable, we 


A = I + D - R, (9) 

where I is the n x n identity matrix; R is the affinity 
matrix composed of R pq ; D is a diagonal matrix with 
D pp = Y q Rpq- We see that D — R is the graph Laplacian 
matrix. Thus A is the regularized Laplacian matrix. 
Expanding Eq. ([8]), we have: 

E{ y, x) = y T Ay - 2z T y + z T z. (10) 

Due to the quadratic terms of y in the energy function 
in Eq. fj~0] ) and the positive definiteness of A (with all 
positive edges of the graph, the Laplacian matrix must 
be positive semidefinite and therefore A must be positive 
definite), we can analytically calculate the integral in the 
partition function (Eq. Q) as: 

Z(x) = J exp | - £(y,x)}dy 

= exp{— z T z} J exp | - y T Ay + 2z T yjdy 

= LL- exp{z T A -1 z — z T z}, (11) 

|A| 5 
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From Eqs. we can now write the probability 

distribution function as: 

Pr ( y l x ) = ex p(-£(y, x )) 

exp | — y T Ay + 2z T y — z T z j 

^4- exp{z T A _1 z — z T z} 

I A | 2 

= exp { - y T Ay + 2z T y - z T A _1 z), (12) 

7T 2 L J 

where z = [zi,..., z n ] T ; I * I denotes the determinant of 
a matrix, and A -1 the inverse of A. Then the negative 
log-likelihood can be written as: 

- log Pr(y|x) = y T Ay - 2z T y + z T A -1 z (13) 

1 T) 

~ 2 lo S(l A l) + 2 1 °g( 7r )- 

During learning, we minimize the negative conditional 
log-likelihood of the training data. Adding regulariza¬ 
tion to 6, (3, we then arrive at the final optimization: 

ri n o + (14) 

N 

logPr(y W |x«;0,/3), 

i= 1 

where x^, yW denote the i-th training image and the 
corresponding depth map; N is the number of training 
images; Ai and A 2 are weight decay parameters. 


2.3.1 Optimization 

We use stochastic gradient descent (SGD) based back 
propagation to solve the optimization problem in Eq. 
( [TT] for learning all parameters of the whole network. 
We project the solutions to the feasible set when the 
bounded constraints /?/ c > 0 is violated. In the following, 
we calculate the partial derivatives of — logPr(y|x) with 
respect to the network parameters 0 and (3. 

For the unary part, here we calculate the partial 
derivatives of — logPr(y|x) with respect to Oj (one ele¬ 
ment of 0). Recall that A = I + D — R (Eq. 0); A T = A; 
(A -1 ) T = A -1 ; |A -1 1 = we have: 


d{— logPr(y|x)} d{— 2 z'y + z'A 1 z} 


d6 t 


dO t 


dz 


= 2 (A-z-y)'-. 


(15) 

(16) 


For the pairwise part, we calculate the partial derivatives 
of — logPr(y|x) with respect to (3k as: 

d{— logPr(y|x)| 

d/3k 

_ <9{y T Ay + z t A~ 1 z - \ log(| A|)} 

d/3 k 

= V T2* _ Z T A -1^ A -! Z _ I 
y 9& y dth 2 |A| 9ft ’ 


where Tr(-) denotes the trace of a matrix. We here 
introduce matrix J to denote Each element of J is: 

_ dApg 
pq ~ dp k 

= d{Dpq ~ Rpq} 

d8k 

where 5(-) is the indicator function, which equals 1 if 
p = q is true and 0 otherwise. According to Eq. (V7\ and 
the definition of J in fj~ 8 ] ), we can now write the partial 
derivative of — logPr(y|x) with respect to (3^ as: 

(19) 

From Eqs. ( [T9] ), ( [18] ), we can see that our framework is 
general and more complicated networks for the pairwise 
part can be seamlessly incorporated. Here, in our case, 
with the definition of R pq in Eq. 0, we have = Spq . 

2.3.2 Depth prediction 

Predicting the depths of a new image is to solve the MAP 
inference in Eq. 0, which writes as: 

y* = argmaxlog Pr(y|x) 
y 

= argmax — y T Ay + 2 z T y. (20) 

y 

With the definition of A in Eq. is symmetric. Then 

by setting the partial derivative of —y T Ay + 2 z T y with 
respect to y to 0 , we have 

d{-y T Ay + 2 z T y} _ Q 
<9y 

=> - (A + A T )y + 2z = 0 

=> y = A - 1 z. (21) 

It shows that closed-form solutions exist for the MAP 
inference in Eq. j20| : 

y* = A - 1 z. (22) 

If we discard the pairwise terms, namely R pq = 0, 
then Eq. ( |22| ) degenerates to y* = z, which is a plain 
CNN regression model (we will report the results of this 
method as a baseline in the experiment). Note that we do 
not need to explicitly compute the matrix inverse A -1 
which can be expensive (cubic in the number of nodes). 
Instead we can obtain the value of A -1 z by solving a 
linear system. 

2.4 Speeding up training using fully convolutional 
networks and superpixel pooling 

Thus far, we have presented our DCNF model for depth 
estimations based on image superpixels. From Fig. [2J we 
can see that for constructing the unary potentials, we are 
essentially performing patchwise convolutions (similar 
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Input image x Convolution maps Unary output z 



Fig. 4 - An overview of the unary part of the DCNF-FCSP model. For the unary part, the input image is fed into a fully-convolutional 
network to produce convolution maps (d is the number of filters of the last fully-convolutional layer). The obtained convolution maps, 
together with the superpixel segmentation over the original input image, are fed to a superpixel pooling layer. The outputs are n x 1 d 
dimensional feature vectors for each of the n superpixels, which are then followed by 3 fully-connected layers to produce the unary output 
z. The pairwise part are omitted here since we use the same network architecture as in the DCNF model (Fig. [51. The unary output z and 
the pairwise output R are used as input to the CRF loss layer, which minimizes the negative log-likelihood (See Sec. |2.4| for details) . 
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Fig. 5 - The fully convolutional network architecture used in Fig.|4] The network takes input images of arbitrary size and output convolution 
maps. 


operations are performed in the R-CNN 1211 ). A major 
concern of the proposed method is its computational 
efficiency and memory consumption, since we need to 
perform convolutions over hundreds or even thousands 
(number of superpixels) of image patches for a single 
input image. Many of those convolutions are redundant 
due to significant image patch overlaps. 

Naturally, a promising direction for reducing the com¬ 
putation burden is to perform convolutions over the en¬ 
tire image once, and then obtain convolutional features 
for each superpixel. However, to find the convolutional 
features of the image superpixels from the obtained 
convolution maps, one needs to establish associations 
between these two. Therefore, we here propose an im¬ 
proved model, which we term as DCNF-FCSP (DCNF 
with Fully Convolutional networks and Superpixel Pool¬ 
ing), based on fully convolutional networks and a novel 
superpixel pooling method, to address this issue. As we 
will show in Sec. 3.2 this new model significantly speeds 
up the training and prediction while producing almost 
the same prediction accuracy. Most importantly, with 
this more efficient model, we are able to design deeper 
networks to achieve better performance. 


2. 4.1 DCNF-FCSP overview 

In comparison to the DCNF model, our new DCNF-FCSP 
model mainly improves the unary part while keeping 
the same pairwise network architecture as in Fig. [2] 
We show the model architecture of the unary part in 
Fig- m Specifically, the input image is fed to a fully 
convolutional network (introduced in the sequel). The 
outputs are convolutional feature maps of size hxwxd 
(d is the dimension of the obtained convolutional feature 


vector, i.e., the number of channels of the last convolu¬ 
tional layer). The size of the convolution maps h x w 
are typically smaller than the input image size. Each 
convolutional feature vector in the output convolution 
maps corresponds to a patch in the input image. We 
propose a novel superpixel pooling method (described 
in the following) to associate these outputs back to the 
superpixels in the input image. Specifically, the convo¬ 
lutional feature maps are used as inputs to a superpixel 
pooling layer to obtain n superpixel feature vectors with 
d dimensions (n is the number of superpixels). The 
n superpixel feature vectors are then fed to 3 fully- 
connected layers to produce the unary output z. We use 
the same pairwise network architecture as depicted in 
Fig. [2j which we do not show here. With the unary out¬ 
put z and the pairwise output R, we construct potential 
functions according to Eqs. © and © and optimize the 
negative log-likelihood. 

Compared to the original proposed DCNF method, 
this improved DCNF-FCSP model only needs to per¬ 
form convolutions over the entire image once, rather 
than hundreds of superpixel image patches. This signif¬ 
icantly reduces the computation and GPU memory bur¬ 
den, bringing around 10 times training speedup while 
producing almost the same prediction accuracy, as we 
demonstrate later in Sec. 13.21 We next introduce the 
fully convolutional networks and the superpixel pooling 
method in detail. 

2.4.2 Fully convolutional networks 
Typical CNN models, including AlexNet [15], vggNet 
El/ l20l > etc., are composed of convolution layers and 
fully connected layers. They usually take standard sized 
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Fig. 6 - An illustration of the superpixel pooling method, which mainly consists of convolution maps upsampling and superpixel pooling. 
The convolution maps are upsampled to the original image size by nearest neighbor interpolations, over which the superpixel masking 
is applied. Then average pooling is performed within each superpixel region, to produce the n convolution features, n is the number of 
superpixels in the image, d is the number of channels of the convolution maps. 


images as inputs, e.g., 224 x 224 pixels, to produce 
nonspatial outputs. In contrast, a fully convolutional 
network can take as inputs arbitrarily sized images, 
and outputs convolutional spatial maps. It has therefore 
been actively studied for dense prediction problems 
02, (El, ED, El very recently. We here exploit this 
new development trend in CNN to speedup the patch- 
wise convolutions in the DCNF model. We illustrate 
the fully convolutional network architecture that we use 
in Fig. [5j As shown, the network is composed of 7 
convolution layers, with the first 5 layers transferred 
from the AlexNet (El We then add 2 more convolution 
layers with 3x3 filter size and 512 channels each. The 
network takes as input images of arbitrary size and 
outputs convolution maps of channel d = 512. Note that 
we can design deeper networks here for pursuing better 
performance. We will demonstrate in Sec. 3.3 the benefits 
of using deeper models. 


2.4.3 Superpixel pooling 

After the input images go through the fully convolu¬ 
tional networks, we acquire convolution maps. To obtain 
superpixel features, we need to associate these convolu¬ 
tional feature maps back to the image superpixels. Thus 
we here propose a novel superpixel pooling method. 
An illustration of this method is shown in Fig. [ 6 j which 
mainly consists of the convolution maps upsampling and 
superpixel average pooling. In general, this superpixel 
pooling layer takes the convolution maps as input and 
outputs superpixel features. Specifically, the obtained 
convolution maps are first upsampled to the original 
image size by nearest neighbor interpolation. Note that 
other interpolation methods such as linear interpolation 
may be applicable too. However, as discussed below, 
nearest neighbor interpolation makes the implementa¬ 
tion much easier and computation faster. 

Then the superpixel masking is applied and average 


pooling is performed within each superpixel region. The 
outputs are pooled superpixel features, which are used 
for constructing the unary potentials. In the sequel, we 
describe this method in detail. 

In practice there is no need to explicitly upsample 
the convolution maps. Instead, we count the frequen¬ 
cies of the convolutional feature vectors that fall into 
each superpixel region. We denote the convolution maps 
as C G R hxwxd / with each element being (i = 
1 ,..., h\ j = 1,..., k = 1,..., d). We represent the t- 
th superpixel feature as a d-dimensional column vector 
h t ( t = 1 , ..., n), with elements h t k {k = 1 , ..., d). 
W t G R hxw is a frequency weighting matrix associated 
to the t-th superpixel, with elements being W ljt - Wij t 
represents the weight of (i, j)-th feature vector in the 
convolution maps that associated to the t- th superpixel. 
To calculate W ljL/ we simply count the occurrences of the 
(i,j)-th convolutional feature vector that appear in the 
t-th superpixel region, and do L\ normalization for each 
W t . By constructing this frequency matrix W t , we avoid 
the explicit upsampling operation. Then the superpixel 
pooling can be represented as: 

htk = ^ W^t * Cijkj (23) 

where (i,j) G ‘Jit denotes the (i,j)-th convolutional 
feature vector in the convolution maps being associated 
to the t- th superpixel. 

During the network forward pass, the superpixel pool¬ 
ing layer performs a linear transformation in Eq. [23] to 
output h t from the input C. For the network backward, 
the gradients can be easily calculated since we have: 

dhtk f W ijt if (i,j)£% (24) 

dCijk \ 0 otherwise. ' ' 

Thus far, we have successfully established the associa¬ 
tions between the convolutional feature maps and the 




















9 


image superpixels. It should be noted that although 
simple as it is, the proposed superpixel pooling method 
jointly exploits the benefits of fully convolutional net¬ 
works and superpixels. It provides an efficient yet 
equally effective approach to the patchwise convolutions 
used in the DCNF model, as we demonstrate in Sec. |3.2| 

2.5 Implementation details 

We implement the network training based on the CNN 
toolbox: VLFeat MatConvNef] with our own modifica¬ 
tions. Training is done on a standard desktop with an 
NVIDIA GTX 780 GPU with 6 GB memory. The DCNF 
model, with the network design in Fig. [3J has approxi¬ 
mately 40 million parameters. The DCNF-FCSP model, 
with the network design in Fig. |5| has around 5.8 million 
parameters. With the very deep network architecture, the 
DCNF-FCSP model has around 20 million parameters. 
The huge amounts of parameters for the DCNF model 
mainly comes from the 4096 fully connected layer in Fig. 
[3} which is removed in the DCNF-FCSP model. Next we 
present the implementation details of the two proposed 
models in the following. 

DCNF We initialize the first 6 layers of the unary part 
in Fig. [3| using a CNN model trained on the ImageNet 
from 1^341. First, we do not back propagate through the 
previous 6 layers by keeping them fixed and train the 
rest of the network (we refer this process as pre-train) 
with the following settings: momentum is set to 0.9, and 
weight decay parameters Ai, A 2 are set to 0.0005. During 
pre-train, the learning rate is initialized at 0 . 0001 , and 
decreased by 40% every 20 epoches. We then run 60 
epoches to report the results of pre-train (with learning 
rate decreased twice). During pre-training, it takes less 
than 0 . 1 s for one network forward pass to do depth 
predictions. Then we train the whole network with the 
same momentum and weight decay. Training the whole 
network takes around 16.5 hours on the Make3D dataset, 
and around 33 hours on the NYU v2 dataset. For this 
fine-tune model, it takes ~ 1 . 1 s for one network forward 
pass to do depth predictions. 

DCNF-FCSP We initialize the first 5 layers in Fig. [5] 
with the same model trained on the ImageNet from 1341 . 
The momentum and weight decay parameters are set 
the same as in the DCNF model. We also use the same 
training protocol as in the DCNF model, i.e., first pre¬ 
train and then fine-tune the whole model. 

3 Experiments 

We organize our experiments into the following three 
parts: 1) We compare our DCNF model with several 
baseline methods to show the benefits of jointly learning 
CNN and CRF; 2) We perform comparisons between the 
two proposed models, i.e., DCNF and DCNF-FCSP, to 
show that the DCNF-FCSP model is equally effective 
while generally being ~ 10 times faster; 3) We compare 

1. VLFeat MatConvNet: http://www.vlfeat.org/matconvnet/ 



0 200 400 600 800 1000 1200 1400 

Number of superpixels per image 


Fig. 11 - Comparison of the whole model training time (network 
forward + backward) in seconds (in log scale) for one image 
on the NYU v2 dataset with respect to different numbers of 
superpixels per image. The DCNF-FCSP model is orders of 
magnitude faster than the DCNF model. 



Fig. 12 - Comparison of the network forward time of the whole 
model during depth prediction (in seconds) for one image on the 
NYU v2 dataset with respect to different numbers of superpixels 
per image. The DCNF-FCSP model is significantly faster than the 
DCNF model. 


our DCNF-FCSP model using deeper network design 
with state-of-the-art methods to show that our model 
performs significantly better. We evaluate on three pop¬ 
ular datasets which are available online: the indoor NYU 
v2 Kinect dataset [35], the outdoor Make3D range image 
dataset j23l and the KITTI dataset l36l . Several measures 
commonly used in prior works are applied here for 
quantitative evaluations: 

• average relative error (rel): ^ J2 P 

• root mean squared error (rms): yj^ ^ p {d 9 p — d p ) 2 ; 

• average log 10 error (loglO): 

^E P |logio< -logiodpl; 

• accuracy with threshold thr : 

d 9t d 

percentage (%) of d p s.t. : max(-^, -^) = S < thr ; 

where d 9 p and d p are the ground-truth and predicted 
depths respectively at pixel indexed by p, and T is the 
total number of pixels in all the evaluated images. 
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Test image 


Ground-truth 


Eigen et al. f3l 


DCNF-FCSP 


Fig. 7 - Examples of qualitative comparisons on the NYUD2 dataset (Best viewed on screen). Color indicates depths (red is far, blue is 
close). Our method yields visually better predictions with sharper transitions, aligning to local details. 


We use SLIC Il37l to segment the images into a set 
of non-overlapping superpixels. For the DCNF model, 
we consider the image patch within a rectangular box 
centred on the centroid of each of the superpixels, which 
contains a large portion of its background surroundings. 
More specifically, we use a box size of 168x168 pixels 
for the NYU v2 and 120 x 120 pixels for the Make3D 
dataset. Following [23], (9), |3), we transform the depths 
into log-scale before training. For better visualizations, 
we apply a cross-bilateral filter 1 551 for inpainting using 
the provided toolbox |[35l after obtaining the superpixel 
depth predictions. Our experiments empirically show 
that this post-processing has negligibly impact on the 


evaluation performance. 

3.1 Baseline comparisons 

To demonstrate the effectiveness of the proposed 
method, we first conduct experimental comparisons 
against several baseline methods: 

• SVR: We train a support vector regressor using the 
CNN representations from the first 6 layers of Fig. 

! 

• SVR (smooth): We add a smoothness term to the 
trained SVR during prediction by solving the infer¬ 
ence problem in Eq. ( |22| . As tuning multiple pair¬ 
wise parameters is not straightforward, we only use 
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Test image 


Ground-truth 


DCNF-FCSP 


Test image 


Ground-truth 


DCNF-FCSP 


Fig. 8 - Examples of depth predictions on the Make3D dataset (Best viewed on screen). Depths are shown in log scale and in color (red 
is far, blue is close). 




TABLE 1 - Baseline comparisons on the NYU v2 dataset. Our method with the whole network training performs the best. 


Method 

Error 

(lower is better) 

Accuracy 
(higher is better) 

rel 

loglO 

rms 

5 < 1.25 

<5 < l.25 z 

<5 < 1.25 J 

SVR 

0.313 

0.128 

1.068 

0.490 

0.787 

0.921 

SVR (smooth) 

0.290 

0.116 

0.993 

0.514 

0.821 

0.943 

Unary only 

0.295 

0.117 

0.985 

0.516 

0.815 

0.938 

Unary only (smooth) 

0.287 

0.112 

0.956 

0.535 

0.828 

0.943 

DCNF (pre-train) 

0.257 

0.101 

0.843 

0.588 

0.868 

0.961 

DCNF (fine-tune) 

0.230 

0.095 

0.824 

0.614 

0.883 

0.971 


color difference as the pairwise potential and choose 
the parameter /3 by hand-tuning on a validation set; 

• Unary only: We replace the CRF loss layer in Fig. 
[2] with a least-square regression layer (by setting 
the pairwise outputs R pq = 0, p, q = 1,..., n), which 
degenerates to a deep regression model trained by 
SGD. 

• Unary only (smooth): We add similar smoothness 
term to our unary only model, as did in the SVR 
(smooth) case. 


3.1.1 NYU v2 data 

The NYU v2 dataset consists of 1449 RGBD images of in¬ 
door scenes, among which 795 are used for training and 
654 for test (we use the standard training/test split pro¬ 
vided with the dataset). We report the baseline compar¬ 
isons in Table [l] From the table, several conclusions can 
be made: 1) When trained with only unary term, deeper 
network is beneficial for better performance, which is 
demonstrated by the fact that our unary only model 
outperforms the SVR model; 2) Adding smoothness term 
to the SVR or our unary only model helps improve the 

















Ground-truth 


DCNF-FCSP 


Test image 

Fig. 9 - Examples of depth predictions on the KITTI dataset (Best viewed on screen). Depths are shown in log scale and in color (red is 
far, blue is close). 

TABLE 2 - Baseline comparisons on the Make3D dataset. Our method with the whole network training performs the best. 


Method 

Error (Cl) 

(lower is better) 

Error (C2) 

(lower is better) 

rel 

loglO 

rms 

rel 

loglO 

rms 

SVR 

0.433 

0.158 

8.93 

0.429 

0.170 

15.29 

SVR (smooth) 

0.380 

0.140 

8.12 

0.384 

0.155 

15.10 

Unary only 

0.366 

0.137 

8.63 

0.363 

0.148 

14.41 

Unary only (smooth) 

0.341 

0.131 

8.49 

0.349 

0.144 

14.37 

DCNF (pre-train) 

0.331 

0.127 

8.82 

0.324 

0.134 

13.29 

DCNF (fine-tune) 

0.314 

0.119 

8.60 

0.307 

0.125 

12.89 


TABLE 3 - Performance comparisons of DCNF and DCNF-FCSP on the NYU v2 dataset. The two models show comparable performance. 


Method 

Error 

(lower is better) 

Accuracy 
(higher is better) 

rel 

loglO 

rms 

5 < 1.25 

5 < 1.25 z 

5 < 1.25 J 

DCNF (pre-train) 

0.257 

0.101 

0.843 

0.588 

0.868 

0.961 

DCNF (fine-tune) 

0.230 

0.095 

0.824 

0.614 

0.883 

0.971 

DCNF-FCSP (pre-train) 

0.261 

0.100 

0.842 

0.583 

0.869 

0.964 

DCNF-FCSP (fine-tune) 

0.237 

0.082 

0.822 

0.608 

0.889 

0.969 


prediction accuracy; 3) Our DCNF model achieves the 
best performance by jointly learning the unary and the 
pairwise parameters in a unified deep CNN framework. 
Moreover, fine-tuning the whole network yields further 
performance gain. These well demonstrate the efficacy 
of our model. 

3.1.2 Make3Ddata 

The Make3D dataset contains 534 images depicting out¬ 
door scenes, with 400 for training and 134 images for 
test. As pointed out in 1231 , [24], this dataset is with 
limitations: the maximum value of depths is 81m with 


far-away objects are all mapped to the one distance of 81 
meters. As a remedy, two criteria are used in (Ml to re¬ 
port the prediction errors: (Ci) Errors are calculated only 
in the regions with the ground-truth depth less than 70 
meters; ( C 2 ) Errors are calculated over the entire image. 
We follow this protocol to report the evaluation results 
in Table [2] As we can see, our full DCNF model with 
the whole network training performs the best among all 
the compared baseline methods. Using deeper networks 
and adding smoothness terms generally help improve 
the performance. 
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Fig. 10 - An illustration of the absolute error maps and the pixel-wise error histograms of our predictions (Left: NYU v2; Right: Make3D). 
The absolute error maps are shown in meters, with the color bar shown in the last row. For the error histogram plot, the horizontal axis 
shows the prediction error in meters (quantized into 20 bins), and the vertical axis shows the percentage of pixels in each bin. 


TABLE 4 - Performance comparisons of DCNF and DCNF-FCSP on the Make3D dataset. The two models perform on par in general. 


Method 

Error (Cl) 

(lower is better) 

Error (C2) 

(lower is better) 


rel 

loglO 

rms 

rel 

loglO 

rms 

DCNF (pre-train) 

0.331 

0.127 

8.82 

0.324 

0.134 

13.29 

DCNF (fine-tune) 

0.314 

0.119 

8.60 

0.307 

0.125 

12.89 

DCNF-FCSP (pre-train) 

0.323 

0.127 

9.01 

0.318 

0.136 

13.89 

DCNF-FCSP (fine-tune) 

0.312 

0.113 

9.10 

0.305 

0.120 

13.24 


3.2 DCNF vs. DCNF-FCSP 

In this section, we compare the performance of the 
proposed DCNF and DCNF-FCSP in terms of both 
prediction accuracy and computational efficiency. The 
compared prediction performance are reported in Table [3] 
and Table [4j We can see that the proposed DCNF-FCSP 
model performs very close to the DCNF model. Next, 
we compare the computational efficiency of these two 
models. Specifically, we report the training time (net¬ 
work forward + backward ) of one image for both whole 
models in terms of different numbers of superpixels we 
use per image. The comparison is conducted on the NYU 
v2 dataset and is shown in Fig. [lT| As demonstrated, 
the DCNF-FCSP model is generally orders of magnitude 
faster than the DCNF model. Moreover, the speedup be¬ 
comes more significant with the number of superpixels 


increases. We also compare the network forward time 
of whole models during depth predictions, and plot the 
results in Fig. [12] The shown time is for processing 
one image. We can see that the DCNF-FCSP model is 
much faster as well as more scalable than the DCNF 
model. Most importantly, with this more efficient DCNF- 
FCSP model, we can design deeper network for better 


performance, as we will show in the sequel in Sec. 3.3 


3.3 State-of-the-art comparisons 

Recent studies have shown that very deep networks can 
significantly improve the image classifications perfor¬ 
mance [39J, m. Thanks to the speedup brought about 
by the superpixel pooling method, we are now able to 
design deeper networks in our framework. We transfer 
the popular VGG-16 net trained on the ImageNet from 
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TABLE 5 - State-of-the-art comparisons on the NYU v2 dataset. Our method performs the best in most cases. Note that the results of 
Eigen et al. 0 are obtained by using extra training data (in the millions in total) while ours are obtained using the standard training set. 


Method 

Error 

(lower is better) 

Accuracy 
(higher is better) 


rel 

loglO 

rms 

5 < 1.25 

<5 < 1.25 z <5 

< 1.25 J 

Saxena et al. |23l 

0.349 

- 

1.214 

0.447 

0.745 

0.897 

DepthTransfer 0 

0.35 

0.131 

1.2 

- 

- 

- 

Discrete-continuous CRF 124] 

0.335 

0.127 

1.06 

- 

- 

- 

Ladicky et al. HI 

- 

- 

- 

0.542 

0.829 

0.941 

Eigen et al. 13] 

0.215 

- 

0.907 

0.611 

0.887 

0.971 

DCNF-FCSP (pre-train) 
DCNF-FCSP (fine-tune) 

0.234 

0.213 

0.095 

0.087 

0.842 

0.759 

0.604 

0.650 

0.885 

0.906 

0.973 

0.976 


TABLE 6 - State-of-the-art comparisons on the Make3D dataset. Our method performs the best. Note that the C2 errors of the Discrete- 
continuous CRF ED are reported with an ad-hoc post-processing step (train a classifier to label sky pixels and set the corresponding 
regions to the maximum depth). 


Method 

Error (Cl) 

(lower is better) 

Error (C2) 

(lower is better) 

rel 

loglO 

rms 

rel 

loglO 

rms 

Saxena et al. t23l 

- 

- 

- 

0.370 

0.187 

- 

Semantic Labelling [91 

- 

- 

- 

0.379 

0.148 

- 

DepthTransfer 171 

0.355 

0.127 

9.20 

0.361 

0.148 

15.10 

Discrete-continuous CRF 1241 

0.335 

0.137 

9.49 

0.338 

0.134 

12.60 

DCNF-FCSP (pre-train) 

0.331 

0.119 

7.77 

0.330 

0.133 

14.46 

DCNF-FCSP (fine-tune) 

0.287 

0.109 

7.36 

0.287 

0.122 

14.09 


TABLE 7 - State-of-the-art comparisons on the KITTI dataset. Our method achieves the best RMS error. Note that the results of Eigen et 
al. a are obtained by using extra training data (in the millions in total) while ours are obtained using 700 training images. The results of 
Saxena et al. ED are reproduced from 131 . 


Method 

Error 

(lower is better) 

Accuracy 
(higher is better) 

rel 

loglO 

rms 

5 < 1.25 

5 < 1.25^ 

5 < 1.25- 3 

Saxena et al. 1231 

0.280 

- 

8.734 

0.601 

0.820 

0.926 

Eigen et al. 0 

0.190 

- 

7.156 

0.692 

0.899 

0.967 

DCNF-FCSP (pre-train) 

0.236 

0.101 

7.421 

0.613 

0.858 

0.949 

DCNF-FCSP (fine-tune) 

0.217 

0.092 

7.046 

0.656 

0.881 

0.958 


f20l . Specifically, we replace the AlexNet part (the first 5 
convolutional layers in Fig. [5} with all the convolutional 
layers (including the 5-th pooling layer) in VGG-16. 
These layers are followed by the 2 newly added con¬ 
volutional layers with 512 channels each, and then the 3 
fully connected layer to construct the unary potentials. 
We follow the same training protocol, i.e., first pre-train 
the remaining layers by fixing the transferred layers (the 
VGG-16 net part) and then fine-tune the whole network. 

3.3.1 NYU v2 data 

In Table |5j we report the results compared to several 
popular state-of-the-art methods on the NYU v2 dataset. 
As can be observed, our method outperforms classic 
methods like Make3d [23], DepthTransfer (7) with large 
margins. Most notably, our results are significantly better 
than that of which jointly exploits depth estimation 
and semantic labelling. Compared to the recent work of 
Eigen et al. (3l, our method also exhibits better perfor¬ 
mance in terms of all metrics. Note that, to overcome 
overfit, they (3) have to collect millions of additional 
labelled images to train their model. In contrast, we only 
use the standard training sets (795) without any extra 
data, yet we achieve better performance. Fig. [7] illustrates 


some qualitative evaluations of our method compared 
against Eigen et al. 0 (We download the predictions of 
m from the authors' website.). Compared to the coarse 
predictions of HI, our method yields better visualizations 
with sharper transitions, aligning to local details. To 
better illustrate how our predictions deviate from the 
ground-truth depths, we plot the absolute depth error 


maps and the pixel-wise error histograms in Fig. 10 


Specifically, the absolute error maps are shown in meters, 
with the color bar shown in the last row. For the error 
histogram plot, the horizontal axis shows the prediction 
error in meters (quantized into 20 bins), and the vertical 
axis shows the percentage of pixels in each bin. As we 
can see, our predictions are mostly well aligned to the 
ground truth depth maps. 


3.3.2 Make3D data 

We show the compared results on the Make3D dataset in 
Table |6] We can see that our DCNF-FCSP model with the 
whole network training ranks the first in overall perfor¬ 
mance, outperforming the compared methods by large 
margins. Note that the C2 errors of [24] are reported with 
an ad-hoc post-processing step, which trains a classifier 
to label sky pixels and set the corresponding regions to 
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the maximum depth. In contrast, we do not employ any 
of those heuristics to refine our results, yet we achieve 
better results in terms of relative error. Compared to 
the results of the DCNF-FCSP model using smaller net 
in Table |4| we get better Cl error but degraded C2 
error. This can be explained from the limitations of this 
dataset that the depths of all far away objects are all 
set to one maximum depth value. Some examples of 
qualitative evaluations are shown in Fig. [8] By jointly 
learning the unary and pairwise potentials, our DCNF- 
FCSP model produce predictions that well capture local 
details. The absolute depth error maps and the pixel- 
wise error histograms are shown in the right part of Fig. 
[TO] As can be observed, our predictions are mostly well 
aligned to the ground truth depth maps, with most of 
the prediction errors are on the boundary regions which 
show extremely large depth jumps. In these cases, our 
predictions exhibit relatively mild depth transitions. 

3.3.3 KITTI data 

We further perform depth estimation on the KITTI 
dataset (36l , which consists of videos taken from a 
driving vehicle with depths captured by a LiDAR sensor. 
We use the same test set, i.e., 697 images from 28 scenes, 
as provided by Eigen 0. As for the training set, we use 
the same 700 images that Eigen et al. 0 used to train 
the method of Saxena et al. l23l . Since the ground-truth 
depths of the KITTI dataset are scattered at irregularly 
spaced points, which only consists of ~ 5% pixels of 
each image, we extract the ground-truth depth closest 
to each superpixel centroid as the superpixel depth 
label. We then construct our CRF graph only on those 
superpixels that have ground-truth labels. The compared 
results are presented in Table [7] In summary. Eigen et 
al. 0 have achieved slightly better performance on this 
dataset by leveraging large amounts of training data 
(in the millions). This can be explained by the fact that 
the highly sparse ground-truth depth maps lessened the 
benefits of the pairwise term in our model. Fig. [9] shows 
some prediction examples. 

4 Conclusion 

We have presented a deep convolutional neural field 
model for depth estimation from a single image. The 
proposed method combines the strength of deep CNN 
and continuous CRF in a unified CNN framework. We 
show that the log-likelihood optimization in our method 
can be directly solved using back propagation without 
any approximations required. Predicting the depths of 
a new image by solving the MAP inference can be 
efficiently performed as closed-form solutions exist. We 
further propose an improved model that is based on 
fully convolutional networks and a novel superpixel 
pooling method. We experimentally demonstrate that it 
is equally effective while brings orders of magnitude 
faster training speedup, which enables the use of deeper 
networks for better performance. Experimental results 


demonstrate that the proposed method outperforms 
state-of-the-art methods on both indoor and outdoor 
scene datasets. 

The main limitation of our method is that it does not 
exploit any geometric cues, which can be explored in 
the future work. Given the general learning framework 
of our method, it is also possible to be applied to 
other vision applications with minimum modification, 
e.g., image denoising, and deblurring. Another potential 
working direction is to apply our depth estimation mod¬ 
els for benefiting other vision tasks, e.g., semantic seg¬ 
mentation, object detection, on traditional vision datasets 
where ground-truth depths are not available. 
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